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Abstract 

The objective in extreme multi-label learning is to train a classifier that can automatically tag a novel data point 
with the most relevant subset of labels from an extremely large label set. Embedding based approaches make training 
and prediction tractable by assuming that the training label matrix is low-rank and hence the effective number of 
labels can be reduced by projecting the high dimensional label vectors onto a low dimensional linear subspace. Still, 
leading embedding approaches have been unable to deliver high prediction accuracies or scale to large problems as 
the low rank assumption is violated in most real world applications. 

This paper develops the XI classifier to address both limitations. The main technical contribution in XI is a 
formulation for learning a small ensemble of local distance preserving embeddings which can accurately predict 
infrequently occurring (tail) labels. This allows XI to break free of the traditional low-rank assumption and boost 
classification accuracy by learning embeddings which preserve pairwise distances between only the nearest label 
vectors. 

We conducted extensive experiments on several real-world as well as benchmark data sets and compared our 
method against state-of-the-art methods for extreme multi-label classification. Experiments reveal that XI can make 
significantly more accurate predictions then the state-of-the-art methods including both embeddings (by as much as 
35%) as well as trees (by as much as 6%). XI can also scale efficiently to data sets with a million labels which are 
beyond the pale of leading embedding methods. 


1 Introduction 

Our objective in this paper is to develop an extreme multi-label classifier, referred to as XI, which can make sig¬ 
nificantly more accurate and faster predictions, as well as scale to larger problems, as compared to state-of-the-art 
embedding based approaches. 

Extreme multi-label classification addresses the problem of learning a classifier that can automatically tag a data 
point with the most relevant subset of labels from a large label set. For instance, there are more than a million labels 
(categories) on Wikipedia and one might wish to build a classifier that annotates a new article or web page with the 
subset of most relevant Wikipedia labels. It should be emphasized that multi-label learning is distinct from multi-class 
classification which aims to predict a single mutually exclusive label. 

Extreme multi-label learning is a challenging research problem as one needs to simultaneously deal with hundreds 
of thousands, or even millions, of labels, features and training points. An obvious baseline is provided by the 1-vs- 
All technique where an independent classifier is learnt per label. Regrettably, this technique is infeasible due to the 
prohibitive training and prediction costs. These problems could be ameliorated if a label hierarchy was provided. 
Unfortunately, such a hierarchy is unavailable in many applications 

Embedding based approaches make training and prediction tractable by reducing the effective number of labels. 
Given a set of n training points {(x^, yi)2=i} with d-dimensional feature vectors G and E-dimensional label 
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vectors G {0,1}^, state-of-the-art embedding approaches project the label vectors onto a lower L-dimensional 
linear subspace as = Uy^, based on a low-rank assumption. Regressors are then trained to predict as Vx^. 
Labels for a novel point x are predicted by post-processing y = U^Vx where is a decompression matrix which 
lifts the embedded label vectors back to the original label space. 

Embedding methods mainly differ in the choice of compression and decompression techniques such as compressed 
sensing ||3|, Bloom biters a, svD a, landmark labels 013, output codes a, etc. The state-of-the-art LEML 
algorithm ii directly optimizes for U^, V using the following objective: argmin^t -I- Tr(V^V) + 

2CEr=ill2/*-UtVx,||2. 

Embedding approaches have many advantages including simplicity, ease of implementation, strong theoretical 
foundations, the ability to handle label correlations, the ability to adapt to online and incremental scenarios, etc. 
Consequently, embeddings have proved to be the most popular approach for tackling extreme multi-label problems 0 

0 [TOi a [m E] [m m 121 [El E [II . 

Embedding approaches also have some limitations. They are slow at training and prediction even for a small 
embedding dimension L. Eor instance, on WikiLSHTC 1(151 [161 . a Wikipedia based challenge data set, LEML with 
L = 500 took 22 hours for training even with early termination while prediction took nearly 300 milliseconds per 
test point. In fact, for WikiLSHTC and other text applications with d-sparse feature vectors, LEML’s prediction time 
fl(L(d + L)) can be an order of magnitude more than even l-vs-All’s prediction time 0(dL) (as c? = 42 L = 500 
for WikiLSHTC). 

More importantly, the critical assumption made by most embedding methods that the training label matrix is low- 
rank is violated in almost all real world applications. EigureUIa) plots the approximation error in the label matrix as 
L is varied from 100 to 500 on the WikiLSHTC data set. As can be seen, even with a 500-dimensional subspace the 
label matrix still has 90% approximation error. We observe that this limitation arises primarily due to the presence of 
hundreds of thousands of “tail” labels (see EigureUIb)) which occur in at most 5 data points each and, hence, cannot 
be well approximated by any linear low dimensional basis. 

This paper develops the XI algorithm which extends embedding methods in multiple ways to address these limi¬ 
tations. Eirst, instead of projecting onto a linear low-rank subspace, XI learns embeddings which non-linearly capture 
label correlations by preserving the pairwise distances between only the closest (rather than all) label vectors, i. e. 
d{zi,Zj) Ri d{yi,yj) if z £ kNN(j)where d is a distance metric. Regressors V are trained to predict z^ = Vx^. 
During prediction, rather than using a decompression matrix, XI uses a k-nearest neighbour (kNN) classiher in the 
learnt embedding space, thus leveraging the fact that nearest neighbour distances have been preserved during training. 
Thus, for a novel point x, the predicted label vector is obtained as y = Ei Vx GkNN(Vx) Vi- kNN 

classiher is also motivated by the observation that kNN outperforms discriminative methods in acutely low training 
data regimes ini as in the case of tail labels. 

The superiority of Xl’s proposed embeddings over traditional low-rank embeddings can be determined in two 
ways. Eirst, as can be seen in Eigure[T] the relative approximation error in learning Xl’s embeddings is signihcantly 
smaller as compared to the low-rank approximation error. Second, XI can improve over state-of-the-art embedding 
methods’ prediction accuracy by as much as 35% (absolute) on the challenging WikiLSHTC data set. XI also sig¬ 
nihcantly outperforms methods such as WSABIE IIT3l which also use kNN classihcation in the embedding space but 
learn their embeddings using the traditional low-rank assumption. 

However, kNN classihers are known to be slow at prediction. XI therefore clusters the training data into C 
clusters, learns a separate embedding per cluster and performs kNN classihcation within the test point’s cluster alone. 
This reduces Xl’s prediction costs to 0(dC + dL + NqL) for determining the cluster membership of the test point, 
embedding it and then performing kNN classihcation respectively, where Nc is the number of points in the cluster 
to which the test point was assigned. XI can therefore be more than two orders of magnitude faster at prediction 
than LEML and other embedding methods on the WikiLSHTC data set where C = 300, Nc ^ 13K, = 50 and 

D = 42. Clustering can also reduce Xl’s training time by almost a factor of C. This allows XI to scale to the AdslM 
data set involving a million labels which is beyond the pale of leading embedding methods. 

Of course, clustering is not a signihcant technical innovation in itself, and could easily have been applied to 
traditional embedding approaches. However, as our results demonstrate, state-of-the-art methods such as LEML do 
not beneht much from clustering. Clustered LEML’s prediction accuracy continues to lag behind Xl’s by 14% on 
WikiLSHTC and the training time on AdslM continues to be prohibitively large. 
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Figure 1: (a) error 11V — Vj 11 /11K11 in approximating the label matrix Y. Global SVD denotes the error incurred by computing 
the rank L SVD of Y. Local SVD computes rank L SVD of Y within each cluster. XI NN objective denotes Xl’s objective 
function. Global SVD incurs 90% error and the error is decreasing at most linearly as well, (b) shows the number of documents 
in which each label is present for the WikiLSFlTC data set. There are about 300K labels which are present in < 5 documents 
lending it a ‘heavy tailed’ distribution, (c) shows Precision® 1 accuracy of XI and localLEML on the Wiki-10 data set as we vary 
the number of clusters. 

The main limitation of clustering is that it can be unstable in high dimensions. XI compensates by learning a 
small ensemble where each individual learner is generated by a different random clustering. This was empirically 
found to help tackle instabilities of clustering and significantly boost prediction accuracy with only linear increases in 
training and prediction time. For instance, on WikiLSHTC, Xl’s prediction accuracy was 56% with an 8 millisecond 
prediction time whereas LEML could only manage 20% accuracy while taking 300 milliseconds for prediction per 
test point. 

Recently, tree based methods III] Em have also become popular for extreme multi-label learning as they enjoy 
significant accuracy gains over the existing embedding methods. For instance, FastXML El can achieve a prediction 
accuracy of 49% on WikiLSHTC using a 50 tree ensemble. However, XI is now able to extend embedding methods 
to outperform tree ensembles, achieving 49.8% with 2 learners and 55% with 10. Thus, by learning local distance 
preserving embeddings, XI can now obtain the best of both worlds. In particular, XI can achieve the highest prediction 
accuracies across all methods on even the most challenging data sets while retaining all the benefits of embeddings and 
eschewing the disadvantages of large tree ensembles such as large model size and lack of theoretical understanding. 

Our contributions in this paper are: First, we identify that the low-rank assumption made by most embedding 
methods is violated in the real world and that local distance preserving embeddings can offer a superior alternative. 
Second, we propose a novel formulation for learning such embeddings and show that it has sound theoretical proper¬ 
ties. In particular, we prove that XI consistently preserves nearest neighbours in the label space and hence learns good 
quality embeddings. Third, we build an efficient pipeline for training and prediction which can be orders of magnitude 
faster than state-of-the-art embedding methods while being significantly more accurate as well. 

2 Method 

Let V — {(xi, yi)... (x„, y„)} be the given training data set, x^ G T” C be the input feature vector, Yi G y C 
{0,1}^ be the corresponding label vector, and yij = 1 iff the j-th label is turned on for x^. Let X = [xi,..., x„] 
be the data matrix and Y = [yi,..., y^] be the label matrix. Given T>, the goal is to learn a multi-label classifier 
f : ^ {0,1}^ that accurately predicts the label vector for a given test point. Recall that in extreme multi-label 

settings, L is very large and is of the same order as n and d, ruling out several standard approaches such as 1-vs-All. 

We now present our algorithm XI which is designed primarily to scale efficiently for large L. Our algorithm is an 
embedding-style algorithm, i.e., during training we map the label vectors y^ to L-dimensional vectors G and 
learn a set of regressors V G s.t. ft! Vxi, Vi. During the test phase, for an unseen point x, we first compute 

its embedding Lx and then perform kNN over the set [Vxi,Vx 2 , ■ ■ ■, Vxn]- To scale our algorithm, we perform a 
clustering of all the training points and apply the above mentioned procedures in each of the cluster separately. Below, 
we first discuss our method to compute the embeddings z^s and the regressors V. Section 12.21 then discusses our 
approach for scaling the method to large data sets. 
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Algorithm 1 XI: Train Algorithm 


Sub-routine 3 XI: SVP 


Require: T) = {{x\,y\)... {Xn, Vn)}, embedding dimension¬ 
ality: L, no. of neighbors: n, no. of clusters: C, regulariza¬ 
tion parameter: smoothing parameter p 

1: Partition X into Q ^,.., Q'^ using fc-means 
2: for each partition do 

3: Form Q using n nearest neighbors of each label vector 

yt e 

4: [U E] t- SVP(Pn(y-’y-’'^), L) 

5: ^ ;7E5 

6: ^ ADMM{X^, Z\ A, p, p) 

7: = V^X^ 

8: end for 

9: Output: {{Q\ V\Z^),{Q^, , Z^} 


Require: Observations: G, index set: O, dimensionality: L 
I: Ml 0, p — 1 
2: rep^t 

3: M ^ M + p(G - Pn{M)) ^ 

4: \U E] Top-EigenDecomp(M, L) 

5: E« ^ max(0, Eii), Vi 

6: M t- [/ • E • t/^ 

7: until Convergence 
8: Ontput: U, E 


Sub-routine 4 XI: ADMM 

Require: Data Matrix : X, Embeddings : Z, Regularization 
Parameter : A, p. Smoothing Parameter : p 
P ■-0,a--0 

repeat 

{Z + p{a-p))X^ 

Q{XX^{l + p) + XI)-^ 
a •(- (VX + P) 

ai = sign(Q:i) ■ max(0, |ai| — ^), Vi 
P P + VX — alpha 
until Convergence 

Ontput: V 


_ 1 

Algorithm 2 XI: Test Algorithm 2 

Require: Test point: x, no. of NN: ii, no. of desired labels: p 3: Q 

1: Qt : partition closest to X 4: V 

2: z y^x 5: 

3: Afz h nearest neighbors of z in Z’’ 6 

4: Px ■<— empirical label dist. for points G Mz ^ 

5: Ppred ^ Topp{^Px) ^ 
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2.1 Learning Embeddings 

As mentioned earlier, our approach is motivated by the fact that a typical real-world data set tends to have a large 
number of tail labels that ensure that the label matrix Y cannot be well-approximated using a low-dimensional linear 
subspace (see Figure [T]|. However, Y can still be accurately modeled using a low-dimensional non-linear manifold. 
That is, instead of preserving distances (or inner products) of a given label vector to all the training points, we attempt 
to preserve the distance to only a few nearest neighbors. That is, we wish to find a T-dimensional embedding matrix 
Z = [zi,..., Zn] € which minimizes the following objective: 

min \\PniY^Y)-Pn{Z^Z)\\l + X\\Z\\i, (1) 

zeRix" 

where the index set fl denotes the set of neighbors that we wish to preserve, i.e., (*, j) C H iff j G Mi. Mi denotes a 
set of nearest neighbors of i. We select A/i = argmax 5 | 5 |<c.„ which is the set of a • n points with the 

largest inner products with y^. Pq : R"^” —>• R"^" is defined as: 


{pMy^y)\, 


0, otherwise. 


( 2 ) 


Also, we add Li regularization, ||Z||i = J^i to the objective function to obtain sparse embeddings. Sparse 

embeddings have three key advantages: a) they reduce prediction time, b) reduce the size of the model, and c) avoid 
overfitting. Now, given the embeddings Z = [zi,... ,z„] G R^^", we wish to learn a multi-regression model to 
predict the embeddings Z using the input features. That is, we require that Z VX where V G Combining 

the two formulations and adding an L 2 -regularization for V, we get: 

min \\Pn{Y^Y)-PniX^V^VX)\\l + X\\V\\j, + p\\VX\\i. (3) 

veRXxd 


Note that the above problem formulation is somewhat similar to a few existing methods for non-linear dimensionality 
reduction that also seek to preserve distances to a few near neighbors GSlIIll. However, in contrast to our approach. 
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these methods do not have a direct out of sample generalization, do not scale well to large-scale data sets, and lack 
rigorous generalization error bounds. 

Optimization: We first note that optimizing Q is a significant challenge as the objective function is non-convex 
as well as non-differentiable. Furthermore, our goal is to perform optimization for data sets where L,n,d^ 100,000. 
To this end, we divide the optimization into two phases. We first learn embeddings Z = [zi,..., z„] and then learn 
regressors V in the second stage. That is, Z is obtained by directly solving ([T]i but without the Li penalty term: 

min \\PniY^Y)-PniZ^Z)\\%= min \\Pn{Y^Y) - PniM)\\%, (4) 

rank{M)<L 


where M = Z"^ Z. Next, V is obtained by solving the following problem: 

min \\Z-VX\\lpX\\V\\l + ^i\\VX\\^. (5) 

VgRi-xd 

Note that the Z matrix obtained using (|4| need not be sparse. However, we store and use VX as our embeddings, so 
that sparsity is still maintained. 

Optimizing (Ell.- Note that even the simplified problem dH is an instance of the popular low-rank matrix completion 
problem and is known to be NP-hard in general. The main challenge arises due to the non-convex rank constraint on 
M. However, using the Singular Value Projection (SVP) method Il20l . a popular matrix completion method, we can 
guarantee convergence to a local minima. 

SVP is a simple projected gradient descent method where the projection is onto the set of low-rank matrices. That 
is, the f-th step update for SVP is given by: 

Mt+i = P^{Mt + r,Pa(Y^Y - M,)), (6) 

where Mt is the f-th step iterate, ry > 0 is the step-size, and Pj^{M) is the projection of M onto the set of rank-L 
positive semi-definite definite (PSD) matrices. Note that while the set of rank-L PSD matrices is non-convex, we can 
still project onto this set efficiently using the eigenvalue decomposition of M. That is, if M = Um-^mUJ^ be the 
eigenvalue decomposition of M. Then, 

Pzi^) = : r) ■ Am(1 : r) ■ C/m(1 : rY, 

where r = min(L, L|^) and is the number of positive eigenvalues of M. (1 : r) denotes the top-r eigenvalues 
of M and Um{^ ■ r) denotes the corresponding eigenvectors. 

While the above update restricts the rank of all intermediate iterates Mt to be at most L, computing rank-L 
eigenvalue decomposition can still be fairly expensive for large n. However, by using special structure in the update dH), 
one can significantly reduce eigenvalue decomposition’s computation complexity as well. In general, the eigenvalue 
decomposition can be computed in time 0{L() where is the time complexity of computing a matrix-vector product. 
Now, for SVP update d6]l, matrix has special structure of M — Mt + r]PQ{Y^Y — Mt). Hence ^ = 0{nL + nn) 
where n = |fl|/n^ is the average number of neighbors preserved by XL Hence, the per-iteration time complexity 
reduces to 0(nL^ + nLfi) which is linear in n, assuming n is nearly constant. 

Optimizing dS]).' dS]) contains an Li term which makes the problem non-smooth. Moreover, as the Li term involves 
both V and X, we cannot directly apply the standard prox-function based algorithms. Instead, we use the ADMM 
method to optimize d5]|. See Sub-routine|4]for the updates and ||2TI| for a detailed derivation of the algorithm. 

Generalization Error Analysis: Let 7^ be a fixed (but unknown) distribution over X x y. Let each training 
point (xijYi) G D be sampled i.i.d. from V. Then, the goal of our non-linear embedding method ([3]) is to learn 
an embedding matrix A = V’^V that preserves nearest neighbors (in terms of label distance/intersection) of any 
(x, y) ^ V. The above requirements can be formulated as the following stochastic optimization problem: 


min C{A)= E £(A;(x,y),(x,y)), 

A^O (x.y),(x,y)~7^ 

rank{A)<k 


(7) 
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where the loss function i{A- (x,y), (x,y)) = ^((y, y))((y.y) - x^ylx)^, and g((y,y)) = I[(y,y) > t], where 
I [•] is the indicator function. Hence, a loss is incurred only if y and y have a large inner product. For an appropriate 
selection of the neighborhood selection operator fl, @ indeed minimizes a regularized empirical estimate of the loss 
function O, i.e., it is a regularized ERM w.r.t. O. 

We now show that the optimal solution A to @ indeed minimizes the loss Q upto an additive approximation error. 
The existing techniques for analyzing excess risk in stochastic optimization require the empirical loss function to be 
decomposable over the training set, and as such do not apply to Q which contains loss-terms with two training points. 
Still, using techniques from the AUC maximization literature Il22l . we can provide interesting excess risk bounds for 
Problem Q. 

Theorem 1. With probability at least 1 — 5 over the sampling of the dataset D, the solution A to the optimization 
problem satisfies 

E~Risk{n) 

+ C{L^ + + Pill) P) y^logi }, 

where A is the minimizer of ®, r = j and A := |a G : A ^ 0, rank{A) < l|. 

See Appendix lAl for a proof of the result. Note that the generalization error bound is independent of both d and 
L, which is critical for extreme multi-label classification problems with large d, L. In fact, the error bound is only 
dependent on L <C T, which is the average number of positive labels per data point. Moreover, our bound also 
provides a way to compute best regularization parameter A that minimizes the error bound. However, in practice, we 
set A to be a fixed constant. 

Theorem [1] only preserves the population neighbors of a test point. Theorem |7] given in Appendix lAl extends 
Theorem [T] to ensure that the neighbors in the training set are also preserved. We would also like to stress that our 
excess risk bound is universal and hence holds even if A does not minimize ®, i.e., C{A) < C{A*) + E-Risk(n) -I- 
(£(A) — £((A*)), where E-Risk(n) is given in Theorem[T] 

2.2 Scaling to Large-scale Data sets 

Eor large-scale data sets, one might require the embedding dimension L to be fairly large (say a few hundreds) which 
might make computing the updates (|6]l infeasible. Hence, to scale to such large data sets, XI clusters the given 
datapoints into smaller local region. Several text-based data sets indeed reveal that there exist small local regions in 
the feature-space where the number of points as well as the number of labels is reasonably small. Hence, we can train 
our embedding method over such local regions without significantly sacrificing overall accuracy. 

We would like to stress that despite clustering datapoints in homogeneous regions, the label matrix of any given 
cluster is still not close to low-rank. Hence, applying a state-of-the-art linear embedding method, such as LEML, to 
each cluster is still significantly less accurate when compared to our method (see Eigure[T]). Naturally, one can cluster 
the data set into an extremely large number of regions, so that eventually the label matrix is low-rank in each cluster. 
However, increasing the number of clusters beyond a certain limit might decrease accuracy as the error incurred during 
the cluster assignment phase itself might nullify the gain in accuracy due to better embeddings. Eigure[T] illustrates 
this phenomenon where increasing the number of clusters beyond a certain limit in fact decreases accuracy of LEML. 

Algorithm [U provides a pseudo-code of our training algorithm. We first cluster the datapoints into C partitions. 
Then, for each partition we learn a set of embeddings using Sub-routine[3and then compute the regression parameters 

A ^ T < C using Sub-routine |4] Eor a given test point x, we first find out the appropriate cluster r. Then, we find 
the embedding z = E’’x. The label vector is then predicted using fc-NN in the embedding space. See Algorithm|2]for 
more details. 

Owing to the curse-of-dimensionality, clustering turns out to be quite unstable for data sets with large d and in 
many cases leads to some drop in prediction accuracy. To safeguard against such instability, we use an ensemble of 
models generated using different sets of clusters. We use different initialization points in our clustering procedure 
to obtain different sets of clusters. Our empirical results demonstrate that using such ensembles leads to significant 
increase in accuracy of XI (see Eigure|2]) and also leads to stable solutions with small variance (see Table|4|i. 
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Figure 2: Variation in Precision® 1 accuracy with model size and the number of learners on large-scale data sets. Clearly, XI 
achieves better accuracy than FastXML and LocalLEML-Ensemble at every point of the curve. For WikiLSTHC, XI with a single 
learner is more accurate than LocalLEML-Ensemble with even 15 learners. Similarly, XI with 2 learners achieves more accuracy 
than FastXML with 50 learners. 


3 Experiments 

Experiments were carried out on some of the largest extreme multi-label benchmark data sets demonstrating that XI 
could achieve significantly higher prediction accuracies as compared to the state-of-the-art. It is also demonstrated 
that XI could be faster at training and prediction than leading embedding techniques such as LEML. 

Data sets: Experiments were carried out on multi-label data sets including AdslM ina (IM labels), Amazon ll2?l 
(670K labels), WikiLSHTC (320K labels), DeliciousLarge 121 (200K labels) and WikilO ED (30K labels). All the 
data sets are publically available except AdslM which is proprietary and is included here to test the scaling capabilities 
of XL 

Unfortunately, most of the existing embedding techniques do not scale to such large data sets. We therefore also 
present comparisons on publically available small data sets such as BibTeX l2^ . MediaMill ll27l . Delicious 1281 and 
EURLex i29l . Table|2]in the supplementary material lists the statistics of each of these data sets. 

Baseline algorithms: This paper’s primary focus is on comparing XI to state-of-the-art methods which can scale 
to the large data sets such as embedding based LEML ||9l and tree based EastXML ITSl and LPSR Q. Naive Bayes 
was used as the base classifier in LPSR as was done in ifTsl . Techniques such as CS El, CPLST EOl, ML-CSSP Q, 
1-vs-All ETl could only be trained on the small data sets given standard resources. Comparisons between XI and 
such techniques are therefore presented in the supplementary material. The implementation for LEML and FastXML 
was provided by the authors. We implemented the remaining algorithms and ensured that the published results could 
be reproduced and were verified by the authors wherever possible. 

Hyper-parameters: Most of Xl’s hyper-parameters were kept fixed including the number of clusters in a learner 
([A^q'j.^jjj/GOOOj), embedding dimension (100 for the small data sets and 50 for the large), number of learners in the 
ensemble (15), and the parameters used for optimizing El- The remaining two hyper-parameters, the k in kNN and 
the number of neighbours considered during SVP, were both set by limited validation on a validation set. 

The hyper-parameters for all the other algorithms were set using fine grained validation on each data set so as 
to achieve the highest possible prediction accuracy for each method. In addition, all the embedding methods were 
allowed a much larger embedding dimension (0.8L) than XI (100) to give them as much opportunity as possible to 
outperform XL 

Evaluation Metric: Precision at k (P@k)has been widely adopted as the metric of choice for evaluating extreme 
multi-label algorithms |[T] E] El El 111 12] • This is motivated by real world application scenarios such as tagging and 
recommendation. Formally, the precision at k for a predicted score vector y S TZ^ is the fraction of correct positive 
predictions in the top k scores of y. 

Results on large data sets with more than lOOK labels:. Table El compares Xl’s prediction accuracy, in terms 
of P@k (k= {1, 3, 5}), to all the leading methods that could be trained on five such data sets. XI could improve over 
the leading embedding method, LEML, by as much as 35% and 15% in terms of P@1 and P@5 on the WikiLSHTC 
data set. Similarly, XI outperformed LEML by 27% and 22% in terms of P@1 and P@5 on the Amazon data set 
which also has many tail labels. The gains on the other data sets are consistent, but smaller, as the tail label problem 
is not so acute. XI could also outperform the leading tree method, FastXML, by 6% in terms of both P@1 and P@5 
on WikiLSHTC and WikilO respectively. This demonstrates the superiority of Xl’s overall pipeline constructed using 
local distance preserving embeddings followed by kNN classification. 
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Table 1: Precision Accuracies (a) Large-scale data sets : Our proposed method XI is as much as 35% more accurate in terms of 
P@1 and 22% in terms of P@5 than LEML, a leading embedding method. Other embedding based methods do not scale to the 
large-scale data sets; we compare against them on small-scale data sets in Table[3] XI is also 6% more accurate (w.r.t. P@1 and 
P@5) than FXML, a state-of-the-art tree method. indicates LEML could not be run with the standard resources, (b) Small-scale 
data sets : XI consistently outperforms state of the art approaches. WSABIE, which also uses kNN classifier on its embeddings is 
significantly less accurate than XI on all the data sets, showing the superiority of our embedding learning algorithm. 

(a) (b) 


Data set 


XI 

LEML 

FastXML 

LPSR-NB 

Data set 


XI 

LEML 

FastXML 

WSABIE 

OneVsAll 


P@1 

85.54 

73.50 

82.56 

72.71 


P@1 

65.57 

62.53 

63.73 

54.77 

61.83 

Wiki 10 

P@3 

73.59 

62.38 

66.67 

58.51 

BibTex 

P@3 

40.02 

38.4 

39.00 

32.38 

36.44 


P@5 

63.10 

54.30 

56.70 

49.40 


P@5 

29.30 

28.21 

28.54 

23.98 

26.46 


P@1 

47.03 

40.30 

42.81 

18.59 


P@1 

68.42 

65.66 

69.44 

64.12 

65.01 

Delicious-Large 

P@3 

41.67 

37.76 

38.76 

15.43 

Delicious 

P@3 

61.83 

60.54 

63.62 

58.13 

58.90 

P@5 

38.88 

36.66 

36.34 

14.07 


P@5 

56.80 

56.08 

59.10 

53.64 

53.26 


P@1 

55.57 

19.82 

49.35 

27.91 


P@1 

87.09 

84.00 

84.24 

81.29 

83.57 

WikiLSHTC 

P@3 

33.84 

11.43 

32.69 

16.04 

MediaMill 

P@3 

72.44 

67.19 

67.39 

64.74 

65.50 


P@5 

24.07 

8.39 

24.03 

11.57 


P@5 

58.45 

52.80 

53.14 

49.82 

48.57 


P@1 

35.05 

8.13 

33.36 

28.65 


P@1 

80.17 

61.28 

68.69 

70.87 

74.96 

Amazon 

P@3 

31.25 

6.83 

29.30 

24.88 

EurLEX 

P@3 

65.39 

48.66 

57.73 

56.62 

62.92 


P@5 

28.56 

6.03 

26.12 

22.37 


P@5 

53.75 

39.91 

48.00 

46.2 

53.42 



P@1 

21.84 

23.11 

17.08 

Ads-lm 

P@3 

14.30 

13.86 

11.38 


P@5 

11.01 

10.12 

8.83 


XI also has better scaling properties as compared to all other embedding methods. In particular, apart from 
LEML, no other embedding approach could scale to the large data sets and, even LEML could not scale to AdslM 
with a million labels. In contrast, a single XI learner could be learnt on WikiLSHTC in 4 hours on a single core and 
already gave ~ 20% improvement in P@1 over LEML (see Figure|2]for the variation in accuracy vs XI learners). In 
fact, XI’s training time on WikiLSHTC was comparable to that of tree based FastXML. FastXML trains 50 trees in 
13 hours on a single core to achieve a P@1 of 49.35% whereas XI could achieve 51.78% by training 3 learners in 12 
hours. Similarly, Xl’s training time on AdslM was 7 hours per learner on a single core. 

XI’s predictions could also be up to 300 times faster than LEMLs. For instance, on WikiLSHTC, XI made 
predictions in 8 milliseconds per test point as compared to LEML’s 279. XI therefore brings the prediction time of 
embedding methods to be much closer to that of tree based methods (FastXML took 0.5 milliseconds per test point on 
WikiLSHTC) and within the acceptable limit of most real world applications. 

Effect of clustering and multiple learners: As mentioned in the introduction, other embedding methods could 
also be extended by clustering the data and then learning a local embedding in each cluster. Ensembles could also 
be learnt from multiple such clusterings. We extend LEML in such a fashion, and refer to it as LocalLEML, by 
using exactly the same 300 clusters per learner in the ensemble as used in XI for a fair comparison. As can be seen 
in Figure |2] XI significantly outperforms LocalLEML with a single XI learner being much more accurate than an 
ensemble of even 10 LocalLEML learners. Figure|2]also demonstrates that Xl’s ensemble can be much more accurate 
at prediction as compared to the tree based FastXML ensemble (the same plot is also presented in the appendix 
depicting the variation in accuracy with model size in RAM rather than the number of learners in the ensemble). The 
figure also demonstrates that very few XI learners need to be trained before accuracy starts saturating. Finally, Table|4] 
shows that the variance in XI s prediction accuracy (w.r.t. different cluster initializations) is very small, indicating that 
the method is stable even though clustering in more than a million dimensions. 

Results on small data sets: Table[3 in the appendix, compares the performance of XI to several popular methods 
including embeddings, trees, kNN and 1-vs-All SVMs. Even though the tail label problem is not acute on these data 
sets, and XI was restricted to a single learner, Xl’s predictions could be significantly more accurate than all the other 
methods (except on Delicious where XI was ranked second). For instance, XI could outperform the closest competitor 
on EurLex by 3% in terms of PL Particularly noteworthy is the observation that XI outperformed WSABIE IfTSlI . 
which performs kNN classification on linear embeddings, by as much as 10% on multiple data sets. This demonstrates 
the superiority of XI’s local distance preserving embeddings over the traditional low-rank embeddings. 
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A Generalization Error Analysis 


To present our results, we first introduce some notation: for any embedding matrix A and dataset V, let 


t{A-V) := 

C{A-V) := 
C{A) := 


nin-W ^ ^ 

i=l 

1 " 

n ( ^ f-v 


n ^ (x,y)~p 


, ,1? £(A;(x,y),(x,y)) 

(x,y),(x,y)~7’ 


We assume, without loss of generality that the data points are confined to a unit ball i.e. ||x ||2 < 1 for all x € A". 
Also let Q = C • {L{r -I- L)) where L is the average number of labels active in a data point, r = j, A and /i are the 
regularization constants used in Q, and C is a universal constant. 

Theorem 1. Assume that all data points are confined to a ball of radius R i.e ||a :||2 < Rfor all x C X. Then with 
probability at least 1 — i5 over the sampling of the data set D, the solution A to the optimization problem Q satisfies, 

C{A) < |£(A*) -f C {L^ -I- (r^ -I- ||A*|||.) R^) ^ — log - 

where r = j, and C and C are universal constants. 

Proof. Our proof will proceed in the following steps. Let A* be the population minimizer of the objective in the 
statement of the theorem. 

1. Step 1 (Capacity bound): we will show that for some r, we have || A||f < r 
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1. Step 2 (Uniform convergence): we will show that w.h.p., sup agA — C{A\ V)^ <0 i log 


3. Step 3 (Point convergence): we will show that w.h.p., C{A*\'D) — C{A*) < O 
Having these results will allow us to prove the theorem in the following manner 




C{A)<t{A,V)+ sup \^t{A-V) - C{A)^ < t{A\V) + O 
\\M\<r 



<C[A*) + 0 



where the second step follows from the fact that A is the empirical risk minimizer. 

We will now prove these individual steps as separate lemmata, where we will also reveal the exact constants in 
these results. 

Lemma 2 (Capacity bound). For the regularization parameters chosen for the loss function £{■), the following holds 
for the minimizer A ofm 

||i||F < Tr{A) < jL. 

Proof Since, A minimizes (O, we have: 

\\A\\f < Tr{A) < y—- Tin.ax(y*,yj) ■ 

A nin — 1) A ij 


□ 


The above result shows that we can, for future analysis, restrict our hypothesis space to 

A{r) := {A€A-.\\A\\l<r^}, 
where we set r = j. This will be used to prove the following result. 

Lemma 3 (Uniform convergence). With probability at least \ — 5 over the choice of the data set T), we have 

t{A- V) - C{A) < 6 (ri?2 + If y^iogl 

Proof. For notional simplicity, we will denote a labeled sample as z = (x,y). Given any two points z, z' S Z = 
X xy and any A G A{r), we will then write 

i{A- z, z') = g{{y, y')) ((y, y') - x^Ax'f + XTr{A), 

so that, for the training set T? = {zi,..., z„}, we have 

1 " 

= nin - 1) ^ ^ 

^ '' i=l 

as well as 

C{A)= E £(A;z,z'). 

Z,Z' r^V 

Note that we ignore the ||Ux||i term in Q completely because it is a regularization term and it won’t increase the 
excess risk. 
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Suppose we draw a fresh data set V = {zi,..., z„} ~ 7^, then we have, by linearity of expectation, 

n 

C{A-V) = - - 

v^v n{n - 1) ^ ^ 

Now notice that for any A G A, suppose we perturb the data set V at the ith location to get a perturbed data set 
then the following holds 

t{A-V)-C{A-V^) < 

n 

which allows us to bound the excess risk as follows 

C{A) - t{A-,V) = J. C{A]V) - t{A-,V) < sup \ J. t{A]V) - C{A-,V) 

AGA{r) 


^ E £(A; V) - CiA- V) + 4(L^ + ^ log ^ 


1 , 1 


< E 

V,T>r^V 


sup^,^(,) {ciA.V) - C{A-,V)] +A{P + log i 


Qn(^(^)) 

where the third step follows from an application of McDiarmid’s inequality and the last step follows from Jensen’s 
inequality. We now bound the quantity Qn{A{r)) below. Let ^{A, z, z') := i{A, z, z') — A ■ Tr{A). Then we have 

Qn{A{r)) = ^ } 


1 


E 


n{n — 1) zi,zi~'p 

^ E 


n 


2 ^ 

< - E 

nzi,zi 


{n - 1) Zi,Zir^V 


sup 1EE e{A;Zi,Zj) -i{A; Zi,Zj 

AGA{r) i—1 j^i 

f " - 

sup EE i{A;ii,2.j) -£{A;Zi,z.j 

AGA{r) I 


■xjl 


sup E Zj, Z^/2-(-j) Zj, Z^y'2-(-j) 

A^A(r) i^i 


< 2- - E 

nzi,ei 


V2 

sup ^E Cj7(Al, Zj, 
A&A{r) I j=i 


= 2•7^„/2(foAl(r)) 


nn{(-oA{r)) 


where the last step uses a standard symmetrization argument with the introduction of the Rademacher variables ~ 
— 1, +1. The second step presents a stumbling block in the analysis since the interaction between the pairs of the points 
means that traditional symmetrization can no longer done. Previous works analyzing such “pairwise” loss functions 
face similar problems Ea. Consequently, this step uses a powerful alternate representation for U-statistics to simplify 
the expression. This technique is attributed to Hoeffding. This, along with the Hoeffding decomposition, are two of 
the most powerful techniques to deal with “coupled” random variables as we have in this situation. 

Theorem 4. For any set of real valued functions qr X x X indexed by t gT, if Xi ,..., Xn are i.i.d. random 
variables then we have 


E 


remin - 1) 


^ ^ QriXijXj 


l< 2 <j<n 


< E 


njl 


sup - 

_^rr.r? 


t=i 
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Applying this decoupling result to the random variables Xi = (z^, z^), the index set A{r) and functions qA{Xi, Xj) 
£{A; Zi, Zj) — £{A; Zi, Zj) = f(A; Zi, Zj) — £{A] Zi, Zj) gives us the second step. We now concentrate on bounding the 
resulting Rademacher average term o ^(r)). We have 


That is. 


nn/ 2 i£ O A{r)) = - E 

nzi,ei 


- - E 

nzi,ei 


7lni2{^ o A{r)) < - E 

nzi,ei 


\ 

sup E ^i£{A] Zj, Z,.j/ 2 +z) 

AeAir) 

f 

sup < ^ej5((y*,y„/2+i)) ((yi,y„/ 2 +i) -xfAx„/ 2 +i)' 

AeA(r) I j=i 


1/2 


E ei5((y*. yn/2+»)) (y», yn/2+i)" 


i=l 


- E 


(A) 

1/2 

sup <1 ^eiff((yi,y„/2+i)) (xf^x„/2+i) 

Ae,4(r) I j=i 


B„{eoA(r)) 


- E 

nzt,ei 


n/2 


sup < ^eiff((yi,y„/2+i)) (yi,yn/2+i) (xfAx„/2+*) 

A&A(r) I i=i 


C„(royt(r)) 


Now since the random variables are zero mean and independent of Zi, we have E Cj = 0 which we can use 

u\zi,Zn/2 + i 

to show that E eiff((yi 7 yn/ 2 +i)) (yti yn/ 2 +i)^ = 0 which gives US, by linearity of expectation, (A) = 0. 

£ilzi,Zn/2 + i U- -U 

To bound the next two terms we use the following standard contraction inequality: 


Theorem 5. Let % be a set of bounded real valued functions from some domain X and fef xi,..., x„ be arbitrary 
elements from X. Furthermore, let fi : K ^ K, i = 1,... ,n be L-Lipschitz functions such that 0i(O) = Q for all i. 
Then we have 


E 

1 " 
sup-y^ 

< LE 

1 " 

sup-y^ ej/i(xi) 






Now define 

Mw) = 9i{yi,yn/2+i))w'^ 


Clearly ^i(O) = 0 and 0 < 9{{yi,y n/2+i)) ^ 1- Moreover, in our case w = Ax' for some A € A{r) and 
||a:||, Ill'll < R. Thus, the function fif) is ri?^-Lipschitz. Note that here we exploit the fact that the contraction 
inequality is actually proven for the empirical Rademacher averages due to which we can take p((yj, yn/ 2 +i)) to be 
a constant dependent only on i. This allows us to bound the term Bn{£ o yl(r)) as follows 


Bn{£ o A{r)) = — E 

nzi,ei 


n /2 


1 E ^i 9 {{yi, yn/2+i)) (xf Ax„/2+i)" 

AeA{r) i=i 


<rR^ ■ - E 

nzi,€i 


nl2 


®’iP 1 E (xf ^X„/2+*) 

AeA{r) I i=i 


< • 7^„/2(yl(r)). 


K„/2M(r-)) 
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Similarly, we can show that 


Cn(£oAir)) = - E 
nzi,^i 


< — E 
n Zi,ei 


[ n/2 

sup < ^eig((y,,y„/2+i)) (yi,y„/2+i) (xfAx„/2+,) 

AeA(r) y i=i 

{ n/2 

E Ci (xfylx„/2+,;) 

i=l 


< 2L ■ TZn/2i-^{r))- 


Thus, we have 

nr,/ 2 ie o yl(r)) < (ri?2 + 2L) ■ 7^„/2(^(r)) 

Now all that remains to be done is bound 72.„(£oyl(r)). This can be done by invoking standard bounds on Rademacher 
averages for regularized function classes. In particular, using the two stage proof technique outlined in ll22ll . we can 
show that 

T^n/2{A{r)) < rR^J- 

' V Ti 

Putting it all together gives us the following bound; with probability at least 1 — <5, we have 

£(i) - t{A-V) < 2(ri?2 + 2L)ri?2,/I + Mp + r^R^)J^\ogl 

\ n \ 2n 0 


as claimed □ 

The final part shows pointwise convergence for the population risk minimizer. 

Lemma 6 (Point convergence). With probability at least 1 — (5 over the choice of the data set T>, we have 

t{A*-V)-C{A*) < 4p + \\A*\\lR'^)^ 

where A* is the population minimizer of the objective in the theorem statement. 

Proof We note that, as before 

E C(A*,V) = C(A*) 


Let I? be a realization of the sample and I?® be a perturbed data set where the 
Then we have 


t{A*-V) - C{A*p) 


4p + \\A*\\lR‘^) 


data point is arbitrarily perturbed. 


n 

Thus, an application of McDiarmid’s inequality shows us that with probability at least 1 — (5, we have 


CiA*p - C{A*) = C{A*-V) - V.^C{A*-V) < 4 (Z^ + p* 

which proves the claim. □ 

Putting the three lemmata together as shown above concludes the proof of the theorem. □ 

Although the above result ensures that the embedding provided by A would preserve neighbors over the population, 
in practice, we are more interested in preserving the neighbors of test points among the training points, as they are 
used to predict the label vector. The following extension of our result shows that A indeed accomplishes this as well. 
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Theorem 7. Assume that all data points are confined to a ball of radius R i.e ||a ;||2 < Rfor all x G X. Then with 
probability at least 1 — (5 over the sampling of the data set T), the solution A to the optimization problem Q ensures 
that, 

C{A-V) < \c{A*-V) + C (Z2 + (r.2 + \\A*\\l) R^) 

where r = and C is a universal constant. 

Note that the loss function C{A] V) exactly captures the notion of how well an embedding matrix A can preserve 
the neighbors of an unseen point among the training points. 

Proof. We first recall and rewrite the form of the loss function considered here. For any data set V = {zi,..., Zn}. 
For any A G A and z gTL, let p{A] z) := E z, z'). This allows us to write 

Z' r^'P 

^ n 1 

C{A] V) \^{A] z, Zj) = - ^ p{A; Zi) 

n ^' zr~^v n ^' 

i—l i—1 


Also note that for any fixed A, we have 

E JiA-V) = CiA). 

Now, given a perturbed data set we have 


C{A;V) - C{A;VA 


< 


4(L2 + r2i?4) 
n 


as before. Since this problem does not have to take care of pairwise interactions between the data points (since 
the “other” data point is being taken expectations over), using standard Rademacher style analysis gives us, with 
probability at least 1 — 5, 


£(A; V) - C{A) < 2 {rR^ + 2L) rR^ ^/ - + _ log _ 


A similar analysis also gives us with the same confidence 


C{A*) - t{A*-V) < + \\A*\\lR^) 



However, an argument similar to that used in the proof of Theorem[T] shows us that 


£(A) < £{A*) + C [P + (r^ + ||^*||p) R'^) \j — log - 
Combining the above inequalities yields the desired result. 


□ 
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B Experiments 


Table 2: Data set Statistics: n and m are the number of training and test points respectively, d and L are the number 
of features and labels, respectively, and d and L are the average number of nonzero features and positive labels in an 
instance, respectively. 


Data set 

d 

L 

n 

m 

d 

L 

MediaMill 

120 

101 

30993 

12914 

120.00 

4.38 

BibTeX 

1836 

159 

4880 

2515 

68.74 

2.40 

Delicious 

500 

983 

12920 

3185 

18.17 

19.03 

EURLex 

5000 

3993 

17413 

1935 

236.69 

5.31 

Wiki 10 

101938 

30938 

14146 

6616 

673.45 

18.64 

DeliciousLarge 

782585 

205443 

196606 

100095 

301.17 

75.54 

WikiLSHTC 

1617899 

325056 

1778351 

587084 

42.15 

3.19 

Amazon 

135909 

670091 

490449 

153025 

75.68 

5.45 

AdslM 

164592 

1082898 

3917928 

1563137 

9.01 

1.96 


Table 3: Results on Small Scale data sets : Comparison of precision accuracies of XI with competing baseline methods on small 
scale data sets. The results reported are average precision values along with standard deviations over 10 random train-test split for 
each Data set. XI outperforms all baseline methods on all data sets (except Delicious, where it is ranked 2"'* after FastXML) 


Data set 

Proposed 

XI 



Embedding 




Tree Based 


Other 

LEML 

WSABIE 

CPLST 

CS 

ML-CSSP 

FastXML-1 

FastXML 

LPSR 

OneVsAIl 

KNN 


P@1 

65.57 ± 0.65 

62.53±0.69 

54.77±0.68 

62.38 ±0.42 

58.87 ±0.64 

44.98 ±0.08 

37.62 ±0.91 

63.73±0.67 

62.09±0.73 

61.83 ±0.77 

57.00 ±0.85 

Bibtex 

p@3 

40.02 ± 0.39 

38.40 ±0.47 

32.38 ±0.26 

37.83 ±0.52 

33.53 ±0.44 

30.42 ±2.37 

24.62 ±0.68 

39.00 ±0.57 

36.69 ±0.49 

36.44 ±0.38 

36.32 ±0.47 


P@5 

29.30 ± 0.32 

28.21 ±0.29 

23.98 ±0.18 

27.62 ±0.28 

23.72 ±0.28 

23.53 ±1.21 

21.92 ±0.65 

28.54 ±0.38 

26.58 ±0.38 

26.46 ±0.26 

28.12 ±0.39 


P@1 

68.42 ±0.53 

65.66 ±0.97 

64.12 ±0.77 

65.31 ±0.79 

61.35 ±0.77 

63.03 ±1.10 

55.34 ±0.92 

69.44 ± 0.58 

65.00±0.77 

65.01 ±0.73 

64.95 ±0.68 

Delicious 

P@3 

61.83 ±0.59 

60.54 ±0.44 

58.13 ±0.58 

59.84 ±0.5 

56.45 ±0.62 

56.26 ±1.18 

50.69 ±0.58 

63.62 ± 0.75 

58.97 ±0.65 

58.90 ±0.60 

58.90 ±0.70 


P@5 

56.80 ±0.54 

56.08 ±0.56 

53.64 ±0.55 

55.31 ±0.52 

52.06 ±0.58 

50.15 ±1.57 

45.99 ±0.37 

59.10 ± 0.65 

53.46 ±0.46 

53.26 ±0.57 

54.12 ±0.57 


P@1 

87.09 ± 0.33 

84.00±0.30 

81.29 ±1.70 

83.34 ±0.45 

83.82 ±0.36 

78.94 ±10.1 

61.14±0.49 

84.24 ±0.27 

83.57 ±0.26 

83.57 ±0.25 

83.46 ±0.19 

MediaMill 

p@3 

72.44 ± 0.30 

67.19 ±0.29 

64.74 ±0.67 

66.17 ±0.39 

67.31 ±0.17 

60.93 ±8.5 

53.37 ±0.30 

67.39 ±0.20 

65.78 ±0.22 

65.50 ±0.23 

67.91 ±0.23 


P@5 

58.45 ± 0.34 

52.80 ±0.17 

49.82 ±0.71 

51.45 ±0.37 

52.80 ±0.18 

44.27 ±4.8 

48.39 ±0.19 

53.14 ±0.18 

49.97 ±0.48 

48.57 ±0.56 

54.24 ±0.21 


P@1 

80.17 ± 0.86 

61.28±1.33 

70.87 ±1.11 

69.93±0.90 

60.18 ±1.70 

56.84±1.5 

49.18 ±0.55 

68.69 ±1.63 

73.01 ±1.4 

74.96 ±1.04 

77.2 ±0.79 

Fiirl.F.X 

p@3 

65.39 ± 0.88 

48.66 ±0.74 

56.62 ±0.67 

56.18 ±0.66 

48.01 ±1.90 

45.4 ±0.94 

42.72 ±0.51 

57.73 ±1.58 

60.36 ±0.56 

62.92 ±0.53 

61.46 ±0.96 


P@5 

53.75 ± 0.80 

39.91 ±0.68 

46.20 ±0.55 

45.74 ±0.42 

38.46 ±1.48 

35.84 ±0.74 

37.35 ±0.42 

48.00±1.40 

50.46 ±0.50 

53.42 ±0.37 

50.45 ±0.64 


Table 4: Stability of XI learners. We show mean precision values over 10 runs of XI on WikiLSHTC with varying number of 
learners. Each individual learner as well as ensemble of XI learners was found to be extremely stable with with standard deviation 
ranging from 0.16% on PI to 0.11% on P5. 


# Learners 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

P@1 

P@3 

P@5 

46.04 ±0.1659 
26.15 ±0.1359 
18.14 ±0.1045 

50.04 ±0.0662 

29.32 ±0.0638 
20.58 ±0.0517 

51.65 ±0.074 

30.70 ±0.052 
21.68 ±0.0398 

52.62 ±0.0878 

31.55 ±0.067 
22.36 ±0.0501 

53.28 ±0.0379 

32.14 ±0.0351 
22.85 ±0.0179 

53.63 ±0.083 

32.48 ±0.0728 
23.12 ±0.0525 

54.03 ±0.0757 

32.82 ±0.0694 
23.4 ±0.0531 

54.28 ±0.0699 

33.07 ±0.0503 
23.60 ±0.0369 

54.44 ±0.048 

33.24 ±0.023 
23.74 ±0.0172 

54.69 ±0.035 

33.45 ±0.0127 
23.92 ±0.0115 
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Precision@1 Precision@1 o Precision@1 Precision@1 


Ads-1 m [L = 1.08M, d = 164K, n = 3.91 M] Ads-1 m [L = 1.08M, d = 164K, n = 3.91 M] Ads-1 m [L = 1 .OSM, d = 164K, n = 3.91 M] 





(a) (b) (c) 

Figure 3: Variation of precision accuracy with model size on Ads-lm Data set 


Amazon [L = 670K, d = 136K, n = 490K] 



Amazon [L = 670K, d = 136K, n = 490K] 



Amazon [L = 670K, d = 136K, n = 490K] 



(a) 


(b) 


(c) 


Figure 4: Variation of precision accuracy with model size on Amazon Data set 


elicious-Large [L=205K, d=782K, n=196K] 



Delicious-Large [L=205K, d=782K, n=196K] 



(b) 


Delicious-Large [L=205K, d=782K, n=196K] 



(C) 


Figure 5: Variation of precision accuracy with model size on Delicious-Large Data set 


WikilO [L=30K, d = 101K, n = 14K] 
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Figure 6: Variation of precision accuracy with model size on WikilO Data set 
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WikiLSHTC [L= 325K, d = 1.61M, n = 1.77M] 
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WikiLSHTC [L= 325K, d = 1.61M, n = 1.77M] 



WikiLSHTC [L= 325K, d = 1.61M, n = 1.77M] 



(a) 


(b) 


(C) 


Figure 7: Variation of precision accuracy with model size on WikiLSFlTC Data set 
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